This tutorial is for pyVDJ v0.1.2. The package is available here: https://github.com/veghp/pyVDJ
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc # v1.4.3
import pyvdj
import upsetplot
from matplotlib_venn import venn2
from matplotlib_venn import venn3
sc.settings.set_figure_params(dpi=100)
adata = sc.read_h5ad('data/anndata.h5ad')
adata.shape
Our example dataset contains T lymphocytes:
sc.pl.umap(adata, color=['CD8A', 'CCL5', 'GZMH'], use_raw=False)
The Th and the CD8+ Tc are clearly separated. A subcluster of Tc expresses CCL5, granzymes and other effector proteins. We use this dataset to demonstrate the functions of pyVDJ. The dataset consists of three donors, and two conditions (control and gain-of-function).
sc.pl.umap(adata, color=['donor', 'status'], use_raw=False)
To load in the VDJ data, we construct filepaths and a dictionary linking files to sample names. The easiest way is to prepare a manifest that lists the GEX sample names next to the V(D)J 10x directory names in a csv file. Each TCR-enriched VDJ csv datafile has one matching GEX sample (corresponding to one 10x channel). (If both TCR and BCR-enrichment was performed, then we have two VDJ csv datafiles for the GEX sample (10x channel). pyVDJ currently does not handle this simultaneously. It is recommended to perform the analysis separetely for the TCR and BCR datasets.)
manifest = pd.read_csv('sample_manifest.csv')
manifest = manifest[manifest['metadata'].isin(adata.obs['metadata'].unique())] # keep the ones in adata
#~ manifest = manifest[manifest['VDJ'].notnull()] # can remove samples which have no VDJ
# Construct paths to VDJ csv files:
paths = 'data/VDJ/' + manifest['VDJ'] + '/filtered_contig_annotations.csv'
samples = dict(zip(paths, manifest['metadata']))
samples
The AnnData object must contain a metadata column (e.g. adata.obs['vdj_obs']) of the following format: cellbarcode + '_' + samplename. This can be constructed from the cell barcodes and sample names (provided that we have sample annotation):
#~ cellnames = adata.obs_names
#~ cellbarcode = cellnames.str.split("-").str[:2].str.join("-") # cell barcode part + '-1'
#~ adata.obs['vdj_obs'] = cellbarcode.astype(str) + "_" + adata.obs['Sample'].astype(str)
We then read V(D)J data into the AnnData object and create annotations. Note that values in the filtered_contig_annotations.csv files cannot be directly added as annotations, because one cell may have 0 to n values per entry. It will be stored in adata.uns and annotation will be generated separately.
adata = pyvdj.load_vdj(samples, adata, obs_col='vdj_obs', cellranger=2)
This loaded 10x V(D)J sequencing data (i.e. filtered_contig_annotations.csv files) into adata.uns['pyvdj']. obs_col specifies the annotation column for cellnames, as prepared above. For Cell Ranger version 3 data, set the cellranger parameter to 3.
For definitions of some words (clone, clonotype etc) used in the next section, see https://github.com/veghp/pyVDJ.
adata = pyvdj.add_obs(adata, obs=['clonotype', 'is_clone'])
This generates annotation in adata.obs. Now we can plot V(D)J properties:
sc.settings.set_figure_params(dpi=70)
sc.pl.umap(adata, color=['vdj_has_vdjdata', 'vdj_is_clone'])
allcell = len(adata.obs['vdj_has_vdjdata'])
vdjcell = sum(adata.obs['vdj_has_vdjdata'])
print('We have %d cells with VDJ data, out of %d cells.' % (vdjcell, allcell))
adata = pyvdj.add_obs(adata, obs=['all_productive', 'any_productive'])
sc.pl.umap(adata, color=['vdj_all_productive', 'vdj_any_productive'])
Note that upon plotting, the pandas Series in the adata.obs dataframe are turned into a categorical type by scanpy. This can important, as you see below we compare with 'True' instead of True:
n_prod = sum(adata.obs['vdj_all_productive'] == 'True')
print('There are %d cells which have only productive chains.' % (n_prod))
n_prod = sum(adata.obs['vdj_any_productive'] == 'True')
print('There are %d cells with at least one productive chain.' % (n_prod))
repl_values = {
'True': 'False',
'False': 'True',
}
adata.obs['vdj_none_productive'] = adata.obs['vdj_any_productive'].replace(to_replace=repl_values, inplace=False)
n_prod = sum(adata.obs['vdj_none_productive'] == 'True')
print('There are %d cells with no productive chains.' % (n_prod))
sc.pl.umap(adata, color=['vdj_none_productive'])
The following command adds one metadata column for each type of chain found in the V(D)J data:
adata = pyvdj.add_obs(adata, obs=['chains'])
sc.pl.umap(adata, color=['vdj_chain_TRA', 'vdj_chain_TRG', 'vdj_chain_IGH'])
Most of the cells have alpha / beta chains. This was observed in other 10x datasets too (not shown).
adata = pyvdj.add_obs(adata, obs=['genes']) # one annotation for each gene
sc.pl.umap(adata, color=['vdj_constant_TRBC1', 'vdj_constant_TRDC', 'vdj_constant_TRGC1'])
Similarly, we can add annotation for each V and each J gene:
adata = pyvdj.add_obs(adata, obs=['v_genes', 'j_genes'])
We can flag which cells are members of a clonotype with not fewer than n clones:
adata = pyvdj.add_gt_n(adata, n=40)
sc.pl.umap(adata, color='vdj_clonotype_n_40')
We can see that the most expanded clonotypes are in the CCL5 (granzyme, etc)-expressing group.
adata = pyvdj.add_obs(adata, obs=['clone_count'])
sc.pl.umap(adata, color='vdj_clone_count')
The plot shows the number of clones in each cell's clonotype, and provides an alternative view of expanded clonotypes. Let's retrieve the CDR3 amino acid sequences of the above clonotypes:
clonotypes = adata.obs['vdj_clonotype_gt_40'].unique()[2:].tolist()
cdr3_dict = pyvdj.get_spec(adata, clonotypes)
cdr3_dict
This can be used to find their specificity in CDR3 databases, such as VDJdb.
Now we obtain statistics for each specified AnnData metadata category:
meta = 'metadata'
#~ adata.obs['metadata'] # one category for each 10x channel
adata = pyvdj.stats(adata, meta)
If ran the first time, this will add a dictionary in adata.uns['pyvdj']['cdr3'], which stores a set of productive CDR3s for each cell:
adata.uns['pyvdj']['cdr3']['cdr3sets']
adata.uns['pyvdj']['cdr3']['cdr3_codes'] # and a dictionary of {code: cdr3 set}
adata.uns['pyvdj']['cdr3']['cdr3_codes_rev'] # and the reverse dictionary of {cdr3set: code}
This was used to calculate the statistics that was added as a dictionary (adata.uns['pyvdj']['stats'][meta]):
adata.uns['pyvdj']['stats'][meta].keys()
See the readme for details on output.
# We define a plotting function. For easy customization, it is not included in the package.
def plot_cells(stats_dict):
hasvdj = stats_dict['cells'][1]
novdj = stats_dict['cells'][0] - hasvdj
bars = np.add(hasvdj, novdj).tolist()
x_axis = list(range(0, len(bars))) # bar positions on x
x_names = list(hasvdj.index)
barWidth = 1
p1 = plt.bar(x_axis, hasvdj, color='#553864', edgecolor='white', width=barWidth)
p2 = plt.bar(x_axis, novdj, bottom=hasvdj, color='#81a75d', edgecolor='white', width=barWidth)
plt.legend((p1[0], p2[0]), ('Has VDJ', 'No VDJ'), ncol=2, bbox_to_anchor=(0, 1), loc='lower left', fontsize='small')
plt.grid(False)
plt.xticks(x_axis, x_names, rotation='vertical')
plt.subplots_adjust(bottom=0.4, left=0.2)
plt.xlabel(stats_dict['meta'])
plt.ylabel('Number of cells')
# Now we can plot cell numbers:
plot_cells(adata.uns['pyvdj']['stats'][meta])
Plot clonotype distribution for each sample:
sc.settings.set_figure_params(dpi=50)
clonotype_dist = adata.uns['pyvdj']['stats'][meta]['clonotype_dist']
for meta_item in adata.obs[meta].unique().tolist():
print(meta_item)
clonotype_dist[meta_item].sort_index().plot('bar', figsize=(7, 5))
plt.subplots_adjust(bottom=0.3, left=0.2)
plt.grid(False)
plt.title(meta_item)
plt.ylabel('Number of cells')
plt.show()
plt.close()
Here, vdj_clone_count = 0 means that the cell did not have VDJ data, and vdj_clone_count = 1 means that the clonotype has only 1 cell (clone).
Apparently, more clonotypes are in the control (normal) sample (C), than in GOF (P1). This could be due to either decreased diversity in GOF, or expansion of a few clonotypes in GOF, which makes it more likely to sample those cells.
In our data, P2 is a non-adult sample, and that could explain the lack of clonotype expansion.
We can plot in a slightly different way, with x-axis being a proper number line:
x_max = max(adata.obs['vdj_clone_count'])
x_max = int(x_max*1.05)
samples_plot = ['c8_C_normal_Untreated_B', 'c3_P1_gof_Untreated_B', 'c6_P2_gof_Untreated_B', ]
sc.settings.set_figure_params(dpi=70)
for s in samples_plot:
df = adata.obs.loc[adata.obs[meta] == s]
x1 = df['vdj_clone_count']
plt.hist(x1, bins=range(0, x_max), width = 2, color='g', label=s)
plt.title(s)
plt.xlabel('vdj_clone_count')
plt.ylabel('Number of cells')
plt.show()
plt.close()
The best approach is, however, to plot the histograms of the number of clonotypes vs clonotype size. First we save the counts for the R powerTCR package. This can be used for finding a threshold for expanded clonotypes, calculating diversity indices, comparing samples. For details, see their publication.
sc.settings.set_figure_params(dpi=70)
for s in adata.obs[meta].unique().tolist():
print(s)
if s not in cl_valcount_index:
continue
x_max = int(max(clonotype_valcounts[s]) * 1.05)
y_max = 100
# y_max = int(max(clonotype_valcounts[s].value_counts()) * 1.05)
fig, axs = plt.subplots(1, figsize=(5, 3.5), sharey=True)
plt.hist(clonotype_valcounts[s], bins=range(0, x_max), width = 1, color='g', label=s)
plt.ylim((0, y_max))
plt.title(s)
plt.xlabel('vdj_clone_count')
plt.ylabel('Number of clonotypes')
# plt.savefig('histogram_' + s + '.png')
plt.show()
plt.close()
We cut off the y axis at 100, so that the rest of the histogram is visible. Use the commented-out line to set y_max.
obs_col = adata.uns['pyvdj']['obs_col']
vdjdf = adata.uns['pyvdj']['df']
vdjdf.columns
vdjdf = vdjdf.loc[vdjdf['barcode_meta'].isin(adata.obs[obs_col])] # optional step: subset for cells in anndata
for s in vdjdf['sample'].unique():
sd = dict(vdjdf.loc[(vdjdf['sample'] == s)].groupby('clonotype_meta').size())
shannon_index = pyvdj.shannon(sd)
simpson_index = pyvdj.simpson(sd)
print(s)
print('The Shannon index for %s is %s.' % (s, shannon_index))
print('The Simpson index for %s is %s.' % (s, simpson_index))
print()
print()
Note that in the literature the inverse Simpson index (i.e. 1 / index ) is often (erroneously) referred to as the Simpson index.
The above can be expanded to calculate these indices separately for each metadata category:
vdjdf = vdjdf.loc[vdjdf['barcode_meta'].isin(adata.obs[obs_col])]
meta_cat = 'status'
for m in adata.obs[meta_cat].unique():
m_cells = adata.obs.loc[(adata.obs[meta_cat] == m)][obs_col]
vdjdf_m = vdjdf.loc[vdjdf['barcode_meta'].isin(m_cells)]
for s in vdjdf_m['sample'].unique():
sd = dict(vdjdf_m.loc[(vdjdf_m['sample'] == s)].groupby('clonotype_meta').size())
shannon_index = pyvdj.shannon(sd)
simpson_index = pyvdj.simpson(sd)
print(s)
print('The Shannon index for sample %s in category %s is %s.' % (s, m, shannon_index))
print('The Simpson index for sample %s in category %s is %s.' % (s, m, simpson_index))
print()
print()
However, this will still treat each 10x channel as a separate donor, and clonotypes across 10x channels (but from the same donor) will not be combined. (In this context, donor means an individual organism (human or animal).) This is addressed in the next section:
sample_donor = dict(zip(adata.uns['pyvdj']['df']['sample'].unique(), ['P1', 'P1', 'P1', 'P2', 'P2', 'P2', 'C', 'C', ]))
sample_donor
adata = pyvdj.find_clones(adata, sample_donor)
The above lines combined clones within the same donor that were sequenced in different 10x channels (and thus have different clonotype IDs) into the same clonotype. This method uses the CDR3 nucleotide sequences. See:
adata.uns['pyvdj']['df']['donor_clonotype_meta']
adata.obs['vdj_donor_clonotype']
# Now that we combined clonotypes, we can calculate the clonotype diversity for each donor:
for s in adata.obs['donor'].unique():
sd = dict(adata.obs.loc[adata.obs['donor'] == s]['vdj_donor_clonotype'].value_counts())
shannon_index = pyvdj.shannon(sd)
simpson_index = pyvdj.simpson(sd)
print(s)
print('The Shannon index for %s is %s.' % (s, shannon_index))
print('The Simpson index for %s is %s.' % (s, simpson_index))
print()
print()
Another option is to calculate the Gini index for each sample (not shown here).
See definitions of public and private clonotypes.
Let's find public CDR3s! (i.e. CDR3s shared between donors or conditions)
There are two ways to approach this:
We can obtain all CDR3 sequences for each condition, then find shared sequences. We provide an example in a section further below.
A much stricter approach is to find CDR3-combinations (clonotypes) shared between conditions. This is more accurate, because the combination of two CDR3 sequences define the specificity of the TCR. One major disadvantage is that it will return very few positives. The implementation of this approach is imperfect here as it considers the set of CDR3s for each clonotype, instead of the pairs of CDR3s that make up a TCReceptor. For this method, we calculate statistics grouped by donor (as opposed to sample, i.e. 10x channel):
meta = 'donor'
adata = pyvdj.stats(adata, meta)
cdr3 = adata.uns['pyvdj']['stats'][meta]['cdr3']
set(cdr3['C']) & set(cdr3['P1']) & set(cdr3['P2'])
As we can see, there are no clonotypes (as defined by the set of CDR3 sequences) shared between all 3 donors.
set(cdr3['P1']) & set(cdr3['P2']) - set(cdr3['C'])
Each frozenset contains the CDR3s of a clonotype as determined by Cell Ranger. As we can see here and below, this strict approach returned clonotypes with only one CDR3 sequenced. In any case, we can search these amino acid sequences in VDJdb. This one is not present in the database, which may mean it is a sequence potentially private (specific) to this condition. However, these databases are relatively small and capture a small part of the potential diversity of these sequences.
set(cdr3['C']) & set(cdr3['P1'])
'CASSLYNEQFF' is a CMV-specific sequence according to VDJdb. Cytomegalovirus (CMV) is a herpes virus that infects the majority of humans, therefore it is not surprising that a receptor against it was found in the two adult samples.
'CASSSTSGAYNEQFF' is not found in the database, but variants of it (with 1 aa difference) are specific to herpes viruses (CMV, EBV) and influenzA.
These searches will be more useful as these databases are extended with new data from 10x VDJ sequencing experiments. Note that this kind of analysis is complicated by that a TCR likely recognises many different antigens.
Perhaps the most useful part of the data is the GOF-specific CDR3s (that are not found in control):
len(set(cdr3['P1']) - set(cdr3['C']))
A simple visualization of the above:
cdr3_codes_rev = adata.uns['pyvdj']['cdr3']['cdr3_codes_rev']
cdr3_coded = {}
for key, value in cdr3.items():
print(key)
cdr3_coded[key] = [cdr3_codes_rev[cdr3] for cdr3 in value]
cdr3_coded.keys()
We create a multi-indexed table that can be used for the plotting function:
contents = upsetplot.from_contents(cdr3_coded)
contents
upsetplot.UpSet(contents, subset_size='count')
See this for help on reading upsetplots.
We can use get_spec() with a list of clonotypes for each category, or obtain sequences directly from the data, filtered for cells as shown here:
meta = 'donor'
vdjdf = adata.uns['pyvdj']['df']
cdr3_simple_dict = {}
for m in adata.obs[meta].unique():
print(m)
cells = adata.obs.loc[adata.obs[meta] == m][obs_col].tolist()
cdr3_m = vdjdf.loc[vdjdf['barcode_meta'].isin(cells)]['cdr3']
cdr3_simple_dict[m] = set(cdr3_m)
cdr3_simple_dict.keys()
venn3([cdr3_simple_dict['P1'], cdr3_simple_dict['P2'], cdr3_simple_dict['C']], set_labels=['P1', 'P2', 'C'])
plt.show()
There are 18+17+17=52 CDR3s shared between healthy control and GOF donors. These CDR3s are not specific to GOF.
We can list the public CDR3s (i.e. the ones found in more than one donor) with the following code:
allcdr3 = set()
for k, v in cdr3_simple_dict.items():
allcdr3 = allcdr3 | v
len(allcdr3)
We have made a set containing all CDR3s and now we list which donor each CDR3 can be found in.
public_cdr3_table = pd.DataFrame(index=allcdr3)
for k, v in cdr3_simple_dict.items():
bool_cdr3 = [True if cdr3_val in v else False for cdr3_val in allcdr3]
public_cdr3_table[k] = bool_cdr3
public_cdr3_table['sum'] = public_cdr3_table.sum(axis=1) > 1
public_cdr3_table[public_cdr3_table['sum']].to_csv('public_cdr3_table.csv')
print('There are %d public CDR3s in the dataset.' % sum(public_cdr3_table['sum']))
To be thorough in finding public CDR3s, we also compare with public databases: download and unzip this file: https://github.com/antigenomics/vdjdb-db/releases/tag/2019-08-08, then
vdjdb_file = 'vdjdb-2019-08-08/vdjdb.txt'
vdjdb = pd.read_csv(vdjdb_file, sep='\t')
public_vdjdb = allcdr3 & set(vdjdb['cdr3'])
print('There are %i CDR3s in our data that were registered in VDJdb.' % len(public_vdjdb))
len(public_vdjdb & set(public_cdr3_table[public_cdr3_table['sum']].index))
We have found 63-27 = 36 public CDR3 sequences.
The above can be repeated for condition (status) instead of donor in order to find condition-specific CDR3s.
Unfortunately, a bug in AnnData causes the saved adata.uns['pyvdj']['df'] to turn into a numpy array upon loading. Therefore, it is recommended to save it separately, and load when necessary:
adata.uns['pyvdj']['df'].to_pickle(path='adata.pyvdj.df.pkl')
# And anndata:
#~ adata.write('data/anndata.h5ad')